Take Home Exercise 1

Visual Analytics of the demographic and financial characteristics of residents in City of Engagement

Author

Oh Jia Wen

Published

May 6, 2023

Modified

May 6, 2023

1. OVERVIEW

City of Engagement, with a total population of 50,000, is a small city located at Country of Nowhere. The city serves as a service centre of an agriculture region surrounding the city. The main agriculture of the region is fruit farms and vineyards. The local council of the city is in the process of preparing the Local Plan 2023. A sample survey of 1000 representative residents had been conducted to collect data related to their household demographic and spending patterns, among other things. The city aims to use the data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.

1.1 The Task

In this take-home exercise, you are required to apply the concepts and methods you had learned in Lesson 1-4 to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods.

2. Data Source

3. Data Preparation

3.1 Install R-packages

Using p_load() of pacman package to load the required libraries

pacman::p_load(ggiraph, plotly, patchwork, DT, tidyverse,knitr,FunnelPlotR,scales) 

3.2 Import Data

3.2.1 Import participants dataset

participants <- read_csv("data/Participants.csv")

3.2.2 Load participants

# A tibble: 6 × 7
  participantId householdSize haveKids   age educationLevel      interestGroup
          <dbl>         <dbl> <lgl>    <dbl> <chr>               <chr>        
1             0             3 TRUE        36 HighSchoolOrCollege H            
2             1             3 TRUE        25 HighSchoolOrCollege B            
3             2             3 TRUE        35 HighSchoolOrCollege A            
4             3             3 TRUE        21 HighSchoolOrCollege I            
5             4             3 TRUE        43 Bachelors           H            
6             5             3 TRUE        32 HighSchoolOrCollege D            
# ℹ 1 more variable: joviality <dbl>
head(participants)

3.2.3 Import Financial Journal dataset

financial_journal <- read_csv("data/FinancialJournal.csv")

3.2.4 Load Financial Journal

# A tibble: 6 × 4
  participantId timestamp           category  amount
          <dbl> <dttm>              <chr>      <dbl>
1             0 2022-03-01 00:00:00 Wage      2473. 
2             0 2022-03-01 00:00:00 Shelter   -555. 
3             0 2022-03-01 00:00:00 Education  -38.0
4             1 2022-03-01 00:00:00 Wage      2047. 
5             1 2022-03-01 00:00:00 Shelter   -555. 
6             1 2022-03-01 00:00:00 Education  -38.0
head(financial_journal)

3.3 Data Wrangling

As seen from the two data tables, there are inaccurate data types. Code with mutate from dplyr to reformat participantID from dbl to chr. groupby participantID remove duplicates reformat timestamp to year and month create new variables for income, expenses, cashflow

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.

unique(financial_journal$category)
[1] "Wage"           "Shelter"        "Education"      "RentAdjustment"
[5] "Food"           "Recreation"    
#create new dataset 
participants_new <- participants %>%
    mutate(
          participantId = as.character(participantId))

#reformat householdSize to Ordinal 
    participants_new$householdSize <- factor(participants$householdSize,
                                      levels = c("1", "2", "3"), 
                                      ordered = TRUE) 
#reformat age group 
participants_new$age_group <- factor(ifelse(participants$age < 20, "Under 20",
                ifelse(participants$age < 30, "20-29",
                    ifelse(participants$age < 40, "30-39",
                      ifelse(participants$age < 50, "40-49", "Above 50")))),
                levels = c("Under 20", "20-29", "30-39", "40-49", "Above 50"),
                ordered= TRUE)

#reformat education level to Ordinal 
participants_new$educationLevel <- factor(participants$educationLevel, 
                                      levels = c("Low", "Graduate", "Bachelors",
                                                 "HighSchoolOrCollege"), 
                                      ordered = TRUE)  

#round up joviality to 2 decimal places 
participants_new$joviality <- round(participants$joviality, 2) 
  
participants_new
# A tibble: 1,011 × 8
   participantId householdSize haveKids   age educationLevel      interestGroup
   <chr>         <ord>         <lgl>    <dbl> <ord>               <chr>        
 1 0             3             TRUE        36 HighSchoolOrCollege H            
 2 1             3             TRUE        25 HighSchoolOrCollege B            
 3 2             3             TRUE        35 HighSchoolOrCollege A            
 4 3             3             TRUE        21 HighSchoolOrCollege I            
 5 4             3             TRUE        43 Bachelors           H            
 6 5             3             TRUE        32 HighSchoolOrCollege D            
 7 6             3             TRUE        26 HighSchoolOrCollege I            
 8 7             3             TRUE        27 Bachelors           A            
 9 8             3             TRUE        20 Bachelors           G            
10 9             3             TRUE        35 Bachelors           D            
# ℹ 1,001 more rows
# ℹ 2 more variables: joviality <dbl>, age_group <ord>
#check min and max age of residents in COE. 
min(participants$age)
[1] 18
max(participants$age)
[1] 60
#remove duplicate rows for all columns
financial_journal_lessdup <- financial_journal %>% 
  distinct()

You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.

scales package (part of the Tidyverse) does exactly this:

Show the code
#create new dataset 
grouped_data <- financial_journal_lessdup %>%
  
#recode ID from dbl to chr, year_mth
    mutate(participantId = as.character(participantId),
         year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m"),
         amount = abs(round(amount,2)),
         .before = 3) %>%
  
#group the columns in the following order 
  group_by(participantId,year_mth, category) %>%
  summarize(total_amount = sum(amount)) 

# Pivot the data frame to have categories as columns
pivoted_fj <- grouped_data %>%
  pivot_wider(names_from = "category", values_from = "total_amount", values_fill = 0)

# Add a new column with mixed categories
pivoted_fj$Expenses <- pivoted_fj$Education + pivoted_fj$Food + pivoted_fj$Recreation + pivoted_fj$Shelter +pivoted_fj$RentAdjustment
pivoted_fj$Income <- pivoted_fj$Wage
pivoted_fj$Cashflow <- pivoted_fj$Income - pivoted_fj$Expenses
pivoted_fj$Shelter <- pivoted_fj$Shelter + pivoted_fj$RentAdjustment

# Output the pivoted data frame
pivoted_fj
# A tibble: 10,691 × 11
# Groups:   participantId, year_mth [10,691]
   participantId year_mth Education  Food Recreation Shelter   Wage
   <chr>         <chr>        <dbl> <dbl>      <dbl>   <dbl>  <dbl>
 1 0             2022-03       38.0  268.      349.     555. 11932.
 2 0             2022-04       38.0  266.      219.     555.  8637.
 3 0             2022-05       38.0  265.      383.     555.  9048.
 4 0             2022-06       38.0  257.      466.     555.  9048.
 5 0             2022-07       38.0  270.     1069.     555.  8637.
 6 0             2022-08       38.0  262.      314.     555.  9459.
 7 0             2022-09       38.0  256.      295.     555.  9048.
 8 0             2022-10       38.0  267.       25.0    555.  8637.
 9 0             2022-11       38.0  261       377.     555.  9048.
10 0             2022-12       38.0  266.      357.     555.  9048.
# ℹ 10,681 more rows
# ℹ 4 more variables: RentAdjustment <dbl>, Expenses <dbl>, Income <dbl>,
#   Cashflow <dbl>

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved.

colSums(pivoted_fj[-1] !=0)
      year_mth      Education           Food     Recreation        Shelter 
         10691           3018          10691           9492          10560 
          Wage RentAdjustment       Expenses         Income       Cashflow 
         10691             72          10691          10691          10691 

Check for missing values

#Check for missing values
any(is.na(participants_new))
[1] FALSE
any(is.na(pivoted_fj))
[1] FALSE

Merge Data Table

#join both data sets 
resident_profile <- full_join(participants_new, pivoted_fj, 
                       by = c("participantId" = "participantId")) %>%
#relocate columns to the front (by importance)
                    relocate(year_mth, .after =participantId) %>%
                    relocate(Cashflow, .after = year_mth) %>%
                    relocate(age_group, .after = Cashflow) %>%
                    relocate(educationLevel, .after = age_group) %>%
                    relocate(Income, .after = haveKids) %>%
                    relocate(Expenses , .after = Income) 
resident_profile %>%
    select(c(1:17))
# A tibble: 10,691 × 17
   participantId year_mth Cashflow age_group educationLevel      householdSize
   <chr>         <chr>       <dbl> <ord>     <ord>               <ord>        
 1 0             2022-03    10722. 30-39     HighSchoolOrCollege 3            
 2 0             2022-04     7559. 30-39     HighSchoolOrCollege 3            
 3 0             2022-05     7808. 30-39     HighSchoolOrCollege 3            
 4 0             2022-06     7733. 30-39     HighSchoolOrCollege 3            
 5 0             2022-07     6704. 30-39     HighSchoolOrCollege 3            
 6 0             2022-08     8291. 30-39     HighSchoolOrCollege 3            
 7 0             2022-09     7904. 30-39     HighSchoolOrCollege 3            
 8 0             2022-10     7752. 30-39     HighSchoolOrCollege 3            
 9 0             2022-11     7817. 30-39     HighSchoolOrCollege 3            
10 0             2022-12     7832. 30-39     HighSchoolOrCollege 3            
# ℹ 10,681 more rows
# ℹ 11 more variables: haveKids <lgl>, Income <dbl>, Expenses <dbl>, age <dbl>,
#   interestGroup <chr>, joviality <dbl>, Education <dbl>, Food <dbl>,
#   Recreation <dbl>, Shelter <dbl>, Wage <dbl>
DT::datatable(resident_profile, class= "compact")

4. Demographics Insights

#Bar chart for resident's age distribution
p1 <- ggplot(data= participants_new,
      aes(x = age_group)) +
      geom_bar() +
      xlab("Age Group") +
      ylab("No.of\nResidents") +
      ylim(0,300) +
      geom_text(aes(label = paste(..count..,",", scales::percent(..count../sum(..count..),accuracy = 0.01))), 
      stat= "count", vjust = -0.5) +
      ggtitle("Age Distribution of Residents ") +
      theme(plot.title = element_text(hjust = 0.5))

#Bar chart for resident's household size distribution
p2 <- ggplot(data= participants_new,
      aes(x = householdSize)) +
      geom_bar() +
      xlab("Size of Household") +
      ylab("No.of\nResidents") +
      ylim(0,400) +
      geom_text(aes(label = paste(..count..,",", scales::percent(..count../sum(..count..),accuracy = 0.01))), 
      stat= "count", vjust = -0.5) +
      ggtitle("Household Size Distribution of Residents ") +
      theme(plot.title = element_text(hjust = 0.5))

#Bar chart for resident's education level
p3 <- ggplot(data= participants_new,
      aes(x = educationLevel)) +
      geom_bar() +
      xlab("Education Level") +
      ylab("No.of\nResidents") +
      ylim(0,600) +
      geom_text(aes(label = paste(..count..,",", scales::percent(..count../sum(..count..),accuracy = 0.01))), 
      stat= "count", vjust = -0.5) +
      ggtitle("Distribution of Residents' Education Level  ") +
      theme(plot.title = element_text(hjust = 0.5))

#Bar chart to visualize if residents have kids 
p4 <- ggplot(data= participants_new,
      aes(x = haveKids)) +
      geom_bar() +
      xlab("Education Level") +
      ylab("No.of\nResidents") +
      ylim(0,850) +
      geom_text(aes(label = paste(..count..,",", scales::percent(..count../sum(..count..),accuracy = 0.01))), 
      stat= "count", vjust = -0.5) +
      ggtitle("Residents' Kids Distribution") +
      theme(plot.title = element_text(hjust = 0.5))

#Bar chart for resident's joviality 
#MISSING THE TEXT ABOVE 

p5 <- ggplot(data= participants_new,
      aes(x = joviality)) +
      geom_histogram(bins=10, color="black",      
                 fill= "darkgrey") +
      xlab("Joviality Level") +
      ylab("No.of\nResidents") +
      ggtitle("Distribution of Residents' Joviality  ") +
      theme(plot.title = element_text(hjust = 0.5))

#Bar chart for residents' interest group 
p6 <- ggplot(data= participants_new,
      aes(x = interestGroup)) +
      geom_bar() +
      xlab("Education Level") +
      ylab("No.of\nResidents") +
      ylim(0,150) +
      geom_text(aes(label = paste(scales::percent(..count../sum(..count..),
            accuracy = 0.01))), 
            stat= "count", vjust = -0.5) +
      ggtitle("Distribution of Residents' Education Level  ") +
      theme(plot.title = element_text(hjust = 0.5))

p1 +p2 + p3 + p4 + p5 + p6  +  plot_layout(ncol = 3, widths = c(1))


::: {.cell}

```{.r .cell-code}
##KIV TO COME BACK AND EDIT. 
p3 <- ggplot(data= resident_profile,
      aes(x = Income)) +
      geom_histogram(bins=30,            
                 color="black",      
                 fill= "darkgrey") +
      xlab("Income") +
      ylab("No.of\nResidents") +
      ylim(0,2100) +
      xlim(-1000,20000) +
      ggtitle("Income Distribution of Residents ") +
      theme(plot.title = element_text(hjust = 0.5))
p3

:::

5. Financial Health Insights

plot_ly(data = resident_profile, 
             x = ~joviality, y = ~Income)
p <- ggplot(data=resident_profile, 
       aes(x = Shelter)) +
  geom_dotplot_interactive(              
    aes(tooltip = Expenses, 
        data_id = Expenses),              
    stackgroups = TRUE,                  
    binwidth = 1,                        
    method = "histodot") +               
  scale_y_continuous(NULL,               
                     breaks = NULL)
girafe(                                  
  ggobj = p,                             
  width_svg = 6,                         
  height_svg = 6*0.618,
  options = list(                        
    opts_hover(css = "fill: #202020;"),  
    opts_hover_inv(css = "opacity:0.2;") 
  )                                        
)                                        
average_income <- round(mean(resident_profile$Income),2)
median_income <-round(median(resident_profile$Income),2)
ymax <- as.numeric(round((IQR(resident_profile$Income)*1.5) +
                quantile(resident_profile$Income,0.75)))
ymin <- as.integer(min(resident_profile$Income))

average_income
[1] 4265.05
median_income
[1] 3613.94
ymax
[1] 9110
ymin
[1] 1600
b <- ggplot(data = pivoted_fj, aes(y = Income)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=16,
               outlier.size=1, notch=FALSE) + 
  coord_flip() + labs(y = "", x = "") + 
  scale_y_continuous(limits = c(0,20000), labels = scales::comma) +
  theme(axis.text = element_blank(), axis.ticks = element_blank()) + 
  stat_boxplot(geom="errorbar", width=0.5) + 
  annotate("text", x=0.35, y=ymax, label=format(ymax, big.mark = ","), 
           size=3, color="lightpink4") +
  annotate("text", x=0.35, y=ymin, label=format(ymin, big.mark = ","), 
           size=3, color="lightpink4")

b

#plotting histogram
h <- ggplot(data = resident_profile, 
            aes(x = Income)) + 
  geom_histogram(color="black", fill="azure4", binwidth = 50000) + 
  scale_x_continuous(limits = c(0,1500000), labels = scales::comma) +
  labs(x = "Resale Price (SGD)", y = "Number of transactions") +
  geom_vline(aes(xintercept = resale_mean), col="darkblue", linewidth=1) +
  annotate("text", x=640000, y=4000, label="Mean resale price:", 
           size=4, color="darkblue") +
  annotate("text", x=640000, y=3750, label=format(average_income, big.mark = ","),
           size=4, color="darkblue") +
  geom_vline(aes(xintercept = resale_median), col="lightpink4", linewidth=1) +
  annotate("text", x=400000, y=4000, label="Median resale price", 
           size=4, color="lightpink4") +
  annotate("text", x=400000, y=3750, label=format(median_income, big.mark = ","),
           size=4, color="lightpink4") +
  theme(axis.text.x = element_text(size=8))
funnel_plot(
  numerator = resident_profile$Income,
  denominator = resident_profile$Expenses,
  group = resident_profile$age
  
)

A funnel plot object with 43 points of which 0 are outliers. 
Plot is adjusted for overdispersion. 
#Initiating the base plot
plot_ly(data = resident_profile,
        x = ~joviality,
        y = ~age,
        color = ~haveKids,
        hovertemplate = ~paste("<br>ID", participantId,
                               "<br>Education Level:", educationLevel,
                               "<br>Household Size ", householdSize),
        type = 'scatter',
        mode = 'markers',
        marker = list(opacity = 0.6,
                      sizemode = 'diameter',
                      line = list(width = 0.2, color = 'white')))